56 research outputs found

    Classification Methods for 16S rRNA Based Functional Annotation

    Get PDF
    Microbial communities play an essential role in Earth’s ecosystems. The goal of this study was to investigate whether the functional potential of microorganisms forming these diverse communities can be directly identified using a 16S rRNA marker gene with supervised learning methods. The recently developed FAPROTAX database has been used along with the SILVA database to produce a training set where 16S rRNA sequences are linked to a number of metabolic functions. Since gene sequences cannot be explicitly used as feature vectors by most classification algorithms, the present research aimed to investigate possible feature engineering approaches for 16S rRNA. Techniques based on Multiple Sequence Alignment (MSA) and N-grams are proposed and tested. The results showed that the feature representation based on the Ngrams outperformed MSA, especially when implemented with large and diverse functional groups. This suggests that a clustering-like alignment procedure results in a biased feature representation of the marker gene. Since classifiers trained using Random Forest and Support Vector Machines techniques were able to accurately detect a range of functional groups it is concluded that the 16S rRNA gene provides substantial information for the direct identification of functional capabilities

    Comparison of Classifiers Applied to Confocal Scanning Laser Ophthalmoscopy Data

    Get PDF
    Objectives: Comparison of classification methods using data of one clinical study. The tuning of hyperparameters is assessed as part of the methods by nested-loop cross-validation. Methods: We assess the ability of 18 statistical and machine learning classifiers to detect glaucoma. The training data set is one case-control study consisting of confocal scanning laser ophthalmoscopy measurement values from 98 glaucoma patients and 98 healthy controls. We compare bootstrap estimates of the classification error by the Wilcoxon signed rank test and box-plots of a bootstrap distribution of the estimate. Results: The comparison of out-of-bag bootstrap estimators of classification errors is assessed by Spearman?s rank correlation, Wilcoxon signed rank tests and box-plots of a bootstrap distribution of the estimate. The classification methods random forests 15.4%, support vector machines 15.9%, bundling 16.3% to 17.8%, and penalized discriminant analysis 16.8% show the best results. Conclusions: Using nested-loop cross-validation we account for the tuning of hyperparameters and demonstrate the assessment of different classifiers. We recommend a block design of the bootstrap simulation to allow a statistical assessment of the bootstrap estimates of the misclassification error. The results depend on the data of the clinical study and the given size of the bootstrap sample

    Editorial

    Get PDF

    Exact change point detection with improved power in small‐sample binomial sequences

    Get PDF
    To detect a change in the probability of a sequence of independent binomial random variables, a variety of asymptotic and exact testing procedures have been proposed. Whenever the sample size or the event rate is small, asymptotic approximations of maximally selected test statistics have been shown to be inaccurate. Although exact methods control the type I error rate, they can be overly conservative due to the discreteness of the test statistics in these situations. We extend approaches by Worsley and Halpern to develop a test that is less discrete to increase the power. Building on ideas from binary segmentation, the proposed test utilizes unused information in the binomial sequences to add a new ordering to test statistics that are of equal value. The exact distributions are derived under side conditions that arise in hypothetical segmentation steps and do not depend on the type of test statistic used (e.g., log likelihood ratio, cumulative sum, or Fisher's exact test). Using the proposed exact segmentation procedure, we construct a change point test and prove that it controls the type‐I‐error rate at any given nominal level. Furthermore, we prove that the new test is uniformly at least as powerful as Worsley's exact test. In a Monte Carlo simulation study, the gain in power can be remarkable, especially in scenarios with small sample size. Giving a clinical database example about pin site infections and an example assessing publication bias in neuropsychiatric drug research, we demonstrate the wide‐ranging applicability of the test

    Editorial

    Get PDF

    Ensemble Pruning for Glaucoma Detection in an Unbalanced Data Set

    Get PDF
    Background: Random forests are successful classifier ensemble methods consisting of typically 100 to 1000 classification trees. Ensemble pruning techniques reduce the computational cost, especially the memory demand, of random forests by reducing the number of trees without relevant loss of performance or even with increased performance of the sub-ensemble. The application to the problem of an early detection of glaucoma, a severe eye disease with low prevalence, based on topographical measurements of the eye background faces specific challenges. Objectives: We examine the performance of ensemble pruning strategies for glaucoma detection in an unbalanced data situation. Methods: The data set consists of 102 topographical features of the eye background of 254 healthy controls and 55 glaucoma patients. We compare the area under the receiver operating characteristic curve (AUC), and the Brier score on the total data set, in the majority class, and in the minority class of pruned random forest ensembles obtained with strategies based on the prediction accuracy of greedily grown sub-ensembles, the uncertainty weighted accuracy, and the similarity between single trees. To validate the findings and to examine the influence of the prevalence of glaucoma in the data set, we additionally perform a simulation study with lower prevalences of glaucoma. Results: In glaucoma classification all three pruning strategies lead to improved AUC and smaller Brier scores on the total data set with sub-ensembles as small as 30 to 80 trees compared to the classification results obtained with the full ensemble consisting of 1000 trees. In the simulation study, we were able to show that the prevalence of glaucoma is a critical factor and lower prevalence decreases the performance of our pruning strategies. Conclusions: The memory demand for glaucoma classification in an unbalanced data situation based on random forests could effectively be reduced by the application of pruning strategies without loss of performance in a population with increased risk of glaucoma

    Bayesian analysis for mixtures of discrete distributions with a non-parametric component

    Get PDF
    Bayesian finite mixture modelling is a flexible parametric modelling approach for classification and density fitting. Many areas of application require distinguishing a signal from a noise component. In practice, it is often difficult to justify a specific distribution for the signal component; therefore, the signal distribution is usually further modelled via a mixture of distributions. However, modelling the signal as a mixture of distributions is computationally non-trivial due to the difficulties in justifying the exact number of components to be used and due to the label switching problem. This paper proposes the use of a non-parametric distribution to model the signal component. We consider the case of discrete data and show how this new methodology leads to more accurate parameter estimation and smaller false non-discovery rate. Moreover, it does not incur the label switching problem. We show an application of the method to data generated by ChIP-sequencing experiments

    Optimal trees selection for classification via out-of-bag assessment and sub-bagging

    Get PDF
    The effect of training data size on machine learning methods has been well investigated over the past two decades. The predictive performance of tree based machine learning methods, in general, improves with a decreasing rate as the size of training data increases. We investigate this in optimal trees ensemble (OTE) where the method fails to learn from some of the training observations due to internal validation. Modified tree selection methods are thus proposed for OTE to cater for the loss of training observations in internal validation. In the first method, corresponding out-of-bag (OOB) observations are used in both individual and collective performance assessment for each tree. Trees are ranked based on their individual performance on the OOB observations. A certain number of top ranked trees is selected and starting from the most accurate tree, subsequent trees are added one by one and their impact is recorded by using the OOB observations left out from the bootstrap sample taken for the tree being added. A tree is selected if it improves predictive accuracy of the ensemble. In the second approach, trees are grown on random subsets, taken without replacement-known as sub-bagging, of the training data instead of bootstrap samples (taken with replacement). The remaining observations from each sample are used in both individual and collective assessments for each corresponding tree similar to the first method. Analysis on 21 benchmark datasets and simulations studies show improved performance of the modified methods in comparison to OTE and other state-of-the-art methods

    Ensemble of Optimal Trees, Random Forest and Random Projection Ensemble Classification

    Get PDF
    The predictive performance of a random forest ensemble is highly associated with the strength of individual trees and their diversity. Ensemble of a small number of accurate and diverse trees, if prediction accuracy is not compromised, will also reduce computational burden. We investigate the idea of integrating trees that are accurate and diverse. For this purpose, we utilize out-of-bag observations as a validation sample from the training bootstrap samples, to choose the best trees based on their individual performance and then assess these trees for diversity using the Brier score on an independent validation sample. Starting from the first best tree, a tree is selected for the final ensemble if its addition to the forest reduces error of the trees that have already been added. Our approach does not use an implicit dimension reduction for each tree as random project ensemble classification. A total of 35 bench mark problems on classification and regression are used to assess the performance of the proposed method and compare it with random forest, random projection ensemble, node harvest, support vector machine, kNN and classification and regression tree (CART). We compute unexplained variances or classification error rates for all the methods on the corresponding data sets. Our experiments reveal that the size of the ensemble is reduced significantly and better results are obtained in most of the cases. Results of a simulation study are also given where four tree style scenarios are considered to generate data sets with several structures
    corecore